The Representation of Document Structure : A Generic Object - Process
نویسندگان
چکیده
Document understanding has attained a level of maturity that requires migration from ad-hoc experimental systems, each of which employs its own set of assumptions and terms, into a solid, standard frame of reference, with generic deenitions that are agreed upon by the document understanding community. The logical structure of a document conveys semantic information that is beyond the document's character string contents. To capture this additional semantics, document understanding must relate the document's physical layout to its logical structure. This work provides a formal deenition of the logical structure of text-intensive documents. A generic framework using a hierarchy of textons is described for the interpretation of any text-intensive document's logical structure. The recursive deenition of textons provides a powerful and exible tool that is not restricted by the size or complexity of the document. Frames are analogously used as recursive constructs for the physical structure description. To facilitate the reverse engineering process which is required to derive the logical structure , we describe DAFS, a Document Attribute Format Speciication, and demonstrate how our framework can serve as a conceptual framework for enhancements of DAFS.
منابع مشابه
Capturing Outlines of Planar Generic Images by Simultaneous Curve Fitting and Sub-division
In this paper, a new technique has been designed to capture the outline of 2D shapes using cubic B´ezier curves. The proposed technique avoids the traditional method of optimizing the global squared fitting error and emphasizes the local control of data points. A maximum error has been determined to preserve the absolute fitting error less than a criterion and it administers the process of curv...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملSegmentation Assisted Object Distinction for Direct Volume Rendering
Ray Casting is a direct volume rendering technique for visualizing 3D arrays of sampled data. It has vital applications in medical and biological imaging. Nevertheless, it is inherently open to cluttered classification results. It suffers from overlapping transfer function values and lacks a sufficiently powerful voxel parsing mechanism for object distinction. In this work, we are proposing an ...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملUsing a Novel Concept of Potential Pixel Energy for Object Tracking
Abstract In this paper, we propose a new method for kernel based object tracking which tracks the complete non rigid object. Definition the union image blob and mapping it to a new representation which we named as potential pixels matrix are the main part of tracking algorithm. The union image blob is constructed by expanding the previous object region based on the histogram feature. The pote...
متن کامل